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Abstract. In statistical setting of the pattern recognition problem the 
number of examples required to approximate an unknown labelling func- 
tion is linear in the VC dimension of the target learning class. In this 
work we consider the question whether such bounds exist if we restrict 
our attention to computable pattern recognition methods, assuming that 
the unknown labelling function is also computable. We find that in this 
case the number of examples required for a computable method to ap- 
proximate the labelling function not only is not linear, but grows faster 
(in the VC dimension of the class) than any computable function. No 
time or space constraints are put on the predictors or target functions; 
the only resource we consider is the training examples. 
The task of pattern recognition is considered in conjunction with another 
learning problem — data compression. An impossibility result for the 
task of data compression allows us to estimate the sample complexity 
for pattern recognition. 



1 Introduction 

The task of pattern recognition consists in predicting an unknown label of some 
observation (or object). For instance, the object can be an image of a hand- 
written letter, in which case the label is the actual letter represented by this 
image. Other examples include DNA sequence identification, recognition of an 
illness based on a set of symptoms, speech recognition, and many others. 

More formally, the objects are drawn independently from the object space 
X (usually X = [0, l] d or R d ) according to some unknown but fixed probability 
distribution P on X, and labels are denned according to some function r] : X — > 
Y, where Y is a finite set (often Y = {0, 1}). The task is to construct a function 
if : {0, 1}* — > Y which approximates ry, i.e. for which P{x : f](x) ^ f( x )} 1S 
small, where P and r\ are unknown but examples X\ , y±, . . . , x n , y n are given; 
Hi :— r)(xi). In the framework of statistical learning theory |7j,[H| it is assumed 
that the function r\ belongs to some known class of functions C. Good error 
estimated can be obtained if the class C is small enough. More formally, the 
number of examples required to obtain a certain level of accuracy (or the sample 
complexity of C) is linear in the VC-dimcnsion of C. 
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In this work we investigate the question whether such bounds can be obtained 
if we consider only computable (on some Turing machine) pattern recognition 
methods. To make the problem more realistic, we also assume that the target 
function r\ is also computable. Both the predictors and the target functions are 
of the form {0, 1}°° -> {0, 1}. 

We show that there are classes Ck of functions for which the number of 
examples needed to approximate the pattern recognition problem to a certain 
accuracy grows faster in the VC dimension of the class than any computable 
function (rather than being linear as in the statistical setting) . In particular this 
holds if Ck is the class of all computable functions of length not greater than k. 

Importantly, the same negative result holds even if we allow the data to be 
generated "actively", e.g. by some algorithm, rather than just by some fixed 
probability distribution. 

To obtain this negative result we consider the task of data compression: an 
impossibility result for the task of data compression allows us to estimate the 
sample complexity for pattern recognition. We also analyze how tight is the 
negative result, and show that for some simple computable rule (based on the 
nearest neighbour estimate) the sample complexity is finite in fc, under different 
definitions of computational patterning recognition task. 

In comparison to the vast literature on pattern recognition and related learn- 
ing problems relatively little attention had been paid to the "computable" ver- 
sion of the task; at least this concerns the task of approximating any computable 
function. There is a track of research in which different concepts of computable 
learnability of functions on countable domains are studied, see A link be- 
tween this framework and statistical learning theory is proposed in j^j , where it 
is argued that for a uniform learnability finite VC dimension is required. 

Another approach is to consider pattern recognition methods as functions 
computable in polynomial time, or under other resource constraints. This ap- 
proach leads to many interesting results, but it usually considers more specified 
settings of a learning problem, such as learning DNFs, finite automata, etc. See 
3 for an introduction to this theory and for references. 

2 Preliminaries 

A (binary) string is a member of the set {0, 1}* = U°^ o {0, 1}™. The length of 
a string x will be denoted by \x\, while x % is the ith element of x, 1 < i < \x\. 
For a set A the symbol \A\ is used for the number of elements in A. We will 
assume the lexicographical order on the set of strings, and when necessary will 
identify {0, 1}* and N via this ordering. Let N be the sets of natural numbers. 
The symbol log is used for log 2 . For a real number a the symbol r a n is the least 
natural number not smaller than a. 

In pattern recognition a labelling function is usually a function from the 
interval [0, 1] or [0, l] d (sometimes more general spaces are considered) to a finite 
space Y := {0, 1}. As we are interested in computable functions, we consider 
instead the functions of the form {0, 1}°° — > {0, 1}. Moreover, we call a partial 



recursive function (or program) rj a labelling function if there exists such t =: 
t(n) £ N that r\ accepts all strings from X t := {0, 1}' and only such strings 
For an introduction to the computability theory see for example 6 . 

It can be argued that this definition of a labelling function is too restrictive 
to approximate well the notion of a real function. However, as we are after 
negative results (for the class of all labelling functions), it is not a disadvantage. 
Other possible definitions are discussed in Section 0] where we are concerned 
with tightness of our negative results. 

All computable function can be encoded (in a canonical way) and thus the 
set of computable functions can be effectively enumerated. Define the length of 
rj as l(rj) := \n\ where n is the minimal number of r\ in such enumeration. 

Define the task of computational pattern recognition as follows. An (un- 
known) labelling function rj is fixed. The objects x\,...,x n £ X are drawn 
according to some distribution P on X t ( v y The labels yi are defined according 
to r), that is yi :— rj(xi). 

A predictor is a family of functions (indexed by n) 



taking values in Y, such that for any n and any t £ N, if Xi £ X t for each i, 
1 < i < n, then the marginal <p(x) is a total recursive function on X t (that is, 
tp n (x) accepts any x £ X t ). We will often identify tp n with its marginal ip n (x) 
when the values of other variables are clear. 

Thus, given a sample X\,yi, . . . ,x n ,y n of labelled objects of the same size 
t, a predictor produces a computable function; this function is supposed to ap- 
proximate the labelling function r\ on X t . 

A computable predictor is a predictor which for any t £ N and any n £ N is 
a total recursive function on X t x Y x • ■ ■ x X t x Y x X t 

3 Main results 

We are interested in what size sample is required to approximate a labelling 
function r\. Moreover, for a (computable) predictor ip, a labelling function r\ and 
< e £ M define 



where t — t(rj) and Pt ranges over all distributions on X t . For 6 £ M, 5 > 
define the sample complexity of r/ with respect to ip as 



1 It is not essential for this definition that n is not a total function. An equivalent 
(for our purposes) definition would be as follows. A labelling function is any total 
function which outputs the string 00 on all inputs except on the strings of some 
length t =: t(n), on each of which it outputs either or 1. 



<p n (xi,Vi, ■ ■ .,x n ,y n ,x), 




N(tp, rj, 6, e) := min{n £ N : S n ((p, r), e) < 8}. 



The number N(tp, rj, 5, e) is the minimal sample size required for a predictor tp to 
achieve e-accuracy with probability 1 — 6 when the (unknown) labelling function 
is 77. 

We can use statistical learning theory [7] to derive the following statement 
Proposition 1. There exists a predictor tp such that 

(8 13 4 2\ 
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for any labelling function r\ and any e,6 > 0. 

Observe that the bound is linear in the length of rj. 

In what follows the proof of this simple statement, we investigate the question 
of whether any such bounds exist if we restrict our attention to computable 
predictors. 

Proof. The predictor tp is defined as follows. For each sample x%, y\, . . . ,x n , y n it 
finds a shortest program fj such that fj(xi) = yi for all i < n. Clearly, l(fj) < l(rf). 
Observe that the VC-dimension of the class of all functions of length not greater 
than l(rf) is bounded from above by I (if), as there are not more than such 
functions. Moreover, tp minimises empirical risk over this class of functions. It 
remains to use the following bound (see e.g. pQ, Corollary 12.4) 
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l\{tp,r],d,s) < max( V(C)-log 
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where V{C) is the VC-dimension of the class C. 



The main result of this work is that for any computable predictor tp there is 
no computable upper bound in terms of l{rf) on the sample complexity of the 
function rj with respect to tp: 

Theorem 1. For any computable predictor tp and any total recursive function 
[3 : N — > N there exist a labelling function n, and some n > f3(l(r))) such that 

P{x E X t ( v ) : tp(x 1 ,y 1 , x n , y n , x) ^ 17(3?)} > 0.05, 

for any X\, . . . , x n € X t ^, where yi = rj{xi) and P is the uniform distribution 
on X t ft) . 

For example, we can take f3(n) — 2™, or 2 2 . 

Corollary 1. For any computable predictor tp, any total recursive function (3 : 
N — > N and any S < 1 

sup N(<p, r], 6, 0.05) > [3(k) 

f]\l(r])<k 

from some k on. 



Observe that there is no 6 in the formulation of Theorem ^ Moreover, it 
is not important how the objects {x\ 1 . . . ,x n ) are generated — it can be any 
individual sample. In fact, we can assume that the sample is chosen in any 
manner, for example by some algorithm. This means that no computable upper 
bound on sample complexity exists even for active learning algorithms. 

It appears that the task of pattern recognition is closely related to another 
learning task — data compression. Moreover, to prove Theorem ^ we need a 
similar negative result for this task. Thus before proceeding with the proof of 
the theorem, we introduce the task of data compression and derive some negative 
results for it. We call a total recursive function ip : {0, 1}* — > {0, 1}* an data 
compressor if it is an injection (i.e. x\ ^ X2 implies ip(xx) ^ ip(x2))- We say 
that an data compressor compresses the string x if < I x I- Clearly, for any 

natural n any data compressor compresses not more than a half of strings of size 
not greater than n. 

We will now present a definition of Kolmogorov complexity; for fine details 
see H], E9- The complexity of a string x € {0, 1}* with respect to a machine £ is 
defined as 

C c (x) = min{Z(p) : ((p) = x}, 
v 

where p ranges over all partial functions (minimum over empty set is defined as 
oo). There exists such a machine ( that C^(x) < C^>(x) + for any x and any 
machine (the constant cq> depends on (J but not on x). Fix any such £ and 
define the Kolmogorov complexity of a string x £ {0,1}* as 

C(x) :=C c (s). 

Clearly, C(x) < |x| + b for any x and for some b depending only on £. A string 
is called c-incompressible if C(x) > \x\ — c. Obviously, any data compressor 
can not compresses many c-incompressible strings, for any c. However, highly 
compressible strings (that is, strings with Kolmogorov complexity low relatively 
to their length) might be expected to be compressed well by some sensible data 
compressor. The following lemma shows that it can not be always the case, no 
matter what we mean by "relatively low" . 

The proof of this lemma is followed by the proof of Theorem ^ 

Lemma 1. For any data compressor ij) and any total recursive function 7 : N — > 
N such that 7 goes monotonically to infinity there exists a binary string x such 
that C(x) < 7(|x|) and \i(j(x)\ > \x\. 

Proof. Suppose the contrary, i.e. that there exist an data compressor ij) and some 
function 7 : N — ► N monotonically increasing to infinity such that for any string 
x if C(x) < j(\x\) then ip(x) < \x\. Let T be the set of all strings which are not 
compressed by tjj 

T := {x : \ip(x)\ > \x\}. 

Define the function r on the set T as follows: r(x) is the number of the 
element x in T 

t(x) := #{x' e T : x' < x} 



for each x £ T. Obviously, the set T is infinite. Moreover, t(x) < x for any x £ T 
(recall that we identify {0, 1}* and N via lexicographical ordering). Observe that 
r is a total recursive function on T and onto N. Thus r _1 : N — * {0, 1}* is a 
total recursive function on N. Thus, for any x £ T, 

C{t{x)) > C(r- 1 (r(a;)) - c = C{x) - c> 7 (|a;|) - c, (1) 

for constant c depending only on r, where the first inequality follows from com- 
putability of t -1 and the last from the definition of T. 

It is a well-known result (see e.g. 0], Theorem 2.3.1) that for any partial 
function 5 that goes monotonically to infinity there is x £ {0, 1}* such that 
C(x) < S(\x\). In particular, allowing S(\x\) = 7 (|x|) — 2c, we conclude that 
there exist such x £ T that 

C(r(x)) < i(\t(x)\) -2c < 7 (|z|) -2c, 

which contradicts Q). 

Proof of Theorem Suppose the contrary, that is that there exists such 
a computable predictor ip and a total function (3 : N — > N such that for any 
labelling function 77, and any n > f3(l(rj)) we have 

P{x : (p(xi,yi, . . .,x„,y n ,x) ^ r)(x)} < 0.05, 

for some Xi £ Xft^, yi = f]{xi), i £ N, where P is the uniform distribution on 

Not restricting generality we can assume that j3 is strictly increasing. Define 
the (total) function /3 -1 (n) := max{m £ N : (3(m) < n}. Define e := 0.05. 
Construct the data compressor t/j as follows. For each y £ {0, 1}* define m := |y|, 
t := r logm n . Generate (lexicographically) first m strings of length t and denote 
them by Xi, 1 < i < m. Define the labelling function r] y as follows: t(rj y ) — t and 
Vy( x i) — V 1 1 1 ^ * ^ m - Clearly, C(rj y ) < C(y) + c, where c is some universal 
constant capturing the above description. 

Let n := \fm. Next we run the predictor tp on all possible tuples x = 
{x\, . . . ,x n ) £ X t n and each time count errors that (p makes on all elements 
ofX t : 

E(x) := {x £ X t : ^(xi,^ 1 , . . .,x n ,y n ,x) ^ 

If |S(x)| > em for each xflj then ip(y) := Qy. 

Otherwise proceed as follows. Fix some tuple x = {x\, . . . , x' n ) such that 
|^( x )| < sm, and let H := {x[, . . . , x' n } be the unordered tuple x. Define 

e x t £ E(x)\H, y l = 
d x, £ E(x)\H, y l = 1 
:= < c x l £ H, y l = 
ci Xi £ H, y % = 1 
* otherwise 



for 1 < i < m. Thus, each k 1 is a member of a five-letter alphabet (a five-element 
set) {eo, ei, Co, ci, *}. Denote the string n 1 . . . k™ by K. 

Observe that the string K, the predictor (p and the order of (xi, . . . , x' n ) 
(which is not contained in K) are sufficient to restore the string y. Furthermore, 
the n-tuple (x^, . . . ,x' n ) can be obtained from H (the un-ordered tuple) by the 
appropriate permutation; let r be the number of this permutation in some fixed 
ordering of all n! such permutations. Using Stirling's formula, we have |r| < 
2nlogn; moreover, to encode r with some self-delimiting code we need not more 
than 4nlogn symbols (for n > 3). Denote such encoding of r by p. 

Next, as there are (1 — e — -^=)m symbols * in the m-element string K , it 
can be encoded by some simple binary code a in such a way that 

\a{K)\ < + 7(em + n). (2) 

Indeed, construct a as follows. First replace all occurrences of the string ** with 
0. Encode the rest of the symbols with any fixed 4-bit encoding such that the 
code of each letter starts with 1. Clearly, a(K) is uniquely decodable. Moreover, 
it is easy to check that (j2J is satisfied, as there are not less than ^(m — 2(em+n)) 
occurrences of the string **. We also need to write m in a self-delimiting way 
(denote it by s); clearly, \s\ < 21ogm. 

Finally, ip(y) = lpsa{K) and \ip(y)\ < |t/|, for m > 2 10 . Thus, ip compresses 
any y such that n > j3(C(i] y )); i.e. such that ^/m > f3(C(i] y )) > (3(C(y) + c). 
This contradicts Lemmanwith 7(fc) := / 3~ 1 (v / fc — c). □ 

4 On tightness of the negative results 

In this section we discuss how tight are the conditions of the statements and to 
what extend they depend on the definitions. 

Let us consider a question whether there exist any (not necessarily com- 
putable) sample-complexity function 

Af v (k,5,s) := sup N((p,r),8,s), 

V :l( V )<k 

at least for some predictor if, or it is always infinity from some k on. 

Proposition 2. There exist a predictor if such that Af v (k, 5, e) < oo for any 
e, 6 > and any k £ N. 

Proof. Clearly, C(n) > C(t v ). Moreover, liminft^oo C(t) = oo so that 

maxjijj : l{jf) < k} < oo 

for any k. It follows that the "pointwise" predictor 

, , U,ifi = i„l<!<n 

<p(x 1 ,y 1 ,...,x n ,y n ,x) = { Q x( £ {xu ..., Xn} ( 3 ) 

satisfies the conditions of the proposition. 



It can be argued that probably this statement is due to our definition of a 
labelling function. Next we will discuss some other variants of this definition. 

First, observe that if we define a labelling function as any total function on 
{0, 1}* then some labelling functions will not approximate any real function; for 
example such is the function rj + which counts bitwise sum of its input: rj + (x) := 
Siii x i m °d 2. That is why we require a labelling function to be defined only 
on X t for some t. 

Another way to define a labelling function (which perhaps makes labelling 
functions most close to real functions) is as a function which accepts any infinite 
binary string. Let us call an i-labelling function any total recursive function 
r\ : {0,1}°° — * {0,1}. That is, r\ is computable on a Turing machine with an 
input tape on which one way infinite input is written, an output tape and possibly 
some working tapes. The program rj is required to halt on any input. The next 
proposition shows that even if we consider such definition the situation does not 
change. The definition of a labelling function n in which it accepts only finite 
strings is chosen in order to stay within conventional computability theory. 

Lemma 2. For any i-labelling function rj there exist 6 N such that i] does 
not scan its input tape further position n v . In particular, r)(x) = n(x') as soon 
as Xi = x\ for any i < n„. 

Proof. For any x € {0, 1}* the program 77 does not scan its tape further some 
position n{x) (otherwise r\ does not halt on x). For any \ G {0, 1} 00 denote by 
n^x) the maximal n£N such that rj scans the input tape up to the position n 
on the input \. 

Suppose that sup x6 { -^00 n^x) = °°i i- e - that the proposition is false. Define 
2; to be the empty string. Furthermore, let 




su P x e{o,i}°° n v( xl > ■ ■ ■ > xt 1 X)=°° 

1 otherwise 



By our assumption, Xi is defined for each i € N. Moreover, it easy to check that 
77 never stops on the input string X1X2 

Besides, it is easy to check that the number is computable. 

Finally, it can be easily verified that Proposition holds true if we consider 
i-labelling functions instead of labelling functions, constructing the required pre- 
dictor based on the nearest neighbour predictor. 

Proposition 3. There exist a predictor ip such that % Afu>{k,S,e) < 00 for any 
e, S > and any k g N, where l J\f is defined as J\f with labelling functions 
replaced by i-labelling functions. 

Proof. Indeed, it suffices to replace the "pointwise" predictor in the proof of 
Proposition [3 by the following predictor if, which assigns to the object x the 
label of that object among X\, . . . ,x n with whom x has longest mutual prefix: 
ip(x 1 ,y 1 , . . .,x n ,y n ,x) := y k , where 

k := argmax 1 < m <„{max{i G N : x 1 . . . x l = x\ n . . . x l m }}; 



to avoid infinite run in case of ties, ip considers only first (say) n digits of Xi and 
break ties in favour of the lowest index. 
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